Corpus-Based Arabic Stemming Using N-Grams
نویسندگان
چکیده
In languages with high word inflation such as Arabic, stemming improves text retrieval performance by reducing words variants. We propose a change in the corpus-based stemming approach proposed by Xu and Croft for English and Spanish languages in order to stem Arabic words. We generate the conflation classes by clustering 3-gram representations of the words found in only 10% of the data in the first stage. In the second stage, these clusters are refined using different similarity measures and thresholds. We conducted retrieval experiments using row data, Light-10 stemmer and 8 different variations of the similarity measures and thresholds and compared the results. The experiments show that 3-gram stemming using the dice distance for clustering and the EM similarity measure for refinement performs better than using no stemming; but slightly worse than Light-10 stemmer. Our method potentially could outperform Light-10 stemmer if more text is sampled in the first stage.
منابع مشابه
JHU/APL Experiments in Tokenization and Non-Word Translation
In the past we have conducted experiments that investigate the benefits and peculiarities attendant to alternative methods for tokenization, particularly overlapping character n-grams. This year we continued this line of work and report new findings reaffirming that the judicious use of n-grams can lead to performance surpassing that of word-based tokenization. In particular we examined: the re...
متن کاملDependency vs. Constituent Based Syntactic N-Grams in Text Similarity Measures for Paraphrase Recognition
Paraphrase recognition consists in detecting if an expression restated as another expression contains the same information. Traditionally, for solving this prob lem, several lexical, syntactic and semantic based tech niques are used. For measuring word overlapping, most of the works use n-grams; however syntactic n-grams have been scantily explored. We propose using syntac tic dependency and...
متن کاملClassical Arabic Poetry Categorization Using N-gram Frequency Statistics
Most of the Arabic language vocabulary is built from the roots derivation. These roots are words composed of three to five consonants letters. Any performance in Arabic language for the purpose of information retrieval needs to deal with the language morphological and structural changes first (which is called the stemming process) then a statistical method for extracting information is implemen...
متن کاملCapturing Out-of-Vocabulary Words in Arabic Text
The increasing flow of information between languages has led to a rise in the frequency of non-native or loan words, where terms of one language appear transliterated in another. Dealing with such out of vocabulary words is essential for successful cross-lingual information retrieval. For example, techniques such as stemming should not be applied indiscriminately to all words in a collection, a...
متن کاملRecherche d'information dans un corpus bruité (OCR)
This paper evaluates the retrieval effectiveness degradation when facing with noisy text corpus. With the use of a test-collection having the clean text, another version with around 5% error rate in recognition and a third with 20% error rate, we have evaluated six IR models based on three text representations (bag-of-words, n-grams, trunc-n) as well as three stemmers. Using the mean reciprocal...
متن کامل